Multilingual Hybrid Text Processing in Ancient Uighur (Chaghatai) Digitalized System
نویسنده
چکیده
This research mainly considers and discusses system codepage in special techniques to multilingual processing of ancient Uighur literatures (Chagatai for abbreviation in the following text). Based on detailed analysis to Arabic code page, Farsi codepage and Uighur codepage in Unicode standard, we presented a codepage and keyboard layout, which is compatible with Chaghatai, Arabic, Farsi, Uighur and Latin characters, is proposed. It is a key technique for achieving specialized Chaghatai word processing systems.
منابع مشابه
A Three-Step Model of Language Detection in Multilingual Ancient Texts
Ancient corpora contain various multilingual patterns. This imposes numerous problems on their manual annotation and automatic processing. We introduce a lexicon building system, called Lexicon Expander, that has an integrated language detection module, Language Detection (LD) Toolkit. The Lexicon Expander post-processes the output of the LD Toolkit which leads to the improvement of f-score and...
متن کاملRule-based Person Name Recognition for Xinjiang Minority Languages
Xinjiang multi-nationality name entity recognition is an important part in multi-language processing. In this paper, we analyze the patterns of Uighur and Kazak person names, and perform the name identity recognition using rule-based approach. We also propose and implement the rules for Uighur and Kazak word segmentation.
متن کاملUnsupervised multilingual learning
For centuries, scholars have explored the deep links among human languages. In this thesis, we present a class of probabilistic models that exploit these links as a form of naturally occurring supervision. These models allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing. Besides these tra...
متن کاملAccurate Collocation Extraction Using a Multilingual Parser
This paper focuses on the use of advanced techniques of text analysis as support for collocation extraction. A hybrid system is presented that combines statistical methods and multilingual parsing for detecting accurate collocational information from English, French, Spanish and Italian corpora. The advantage of relying on full parsing over using a traditional window method (which ignores the s...
متن کاملA Multimodal Framework for the Recognition of Ancient Tamil Handwritten Characters in Palm Manuscript Using Boolean Bitmap Pattern of Image Zoning
Tamil is one of the oldest languages in the world with rich literature. In the ancient days, the writers, especially in Tamilnadu, used palm leaves to encrypt their writing. A very good example of the usage of Palm leaf manuscripts to store the history is Tamil grammar book named Tolkappiyam which was written during 4th B.C. The ancient literature includes many palm leaf manuscripts that contai...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of Chinese Language and Computing
دوره 15 شماره
صفحات -
تاریخ انتشار 2005